Unsupervised Context Sensitive Language Acquisition from a Large Corpus
نویسندگان
چکیده
We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be “intermediate” for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
منابع مشابه
How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs
Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...
متن کاملUnsupervised language acquisition: syntax from plain corpus
We describe results of a novel algorithm for grammar induction from a large corpus. The ADIOS (Automatic DIstillation of Structure) algorithm searches for significant patterns, chosen according to context dependent statistical criteria, and builds a hierarchy of such patterns according to a set of rules leading to structured generalization. The corpus is thus generalized into a context free gra...
متن کاملRich Syntax from a Raw Corpus: Unsupervised Does It
We compare our model of unsupervised learning of linguistic structures, ADIOS [1], to some recent work in computational linguistics and in grammar theory. Our approach resembles the Construction Grammar in its general philosophy (e.g., in its reliance on structural generalizations rather than on syntax projected by the lexicon, as in the current generative theories), and the Tree Adjoining Gram...
متن کاملBridging computational, formal and psycholinguistic approaches to language
We compare our model of unsupervised learning of linguistic structures, ADIOS [1, 2, 3], to some recent work in computational linguistics and in grammar theory. Our approach resembles the Construction Grammar in its general philosophy (e.g., in its reliance on structural generalizations rather than on syntax projected by the lexicon, as in the current generative theories), and the Tree Adjoinin...
متن کاملUnsupervised Learning of Word Boundary with Description Length Gain
This paper presents an unsupervised approach to lexical acquisition with the goodness measure description length gain (DLG) formulated following classic information theory within the minimum description length (MDL) paradigm. The learning algorithm seeks for an optimal segmentation of an utterance that maximises the description length gain from the individual segments. The resultant segments sh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003